DeepComBat: A statistically motivated, hyperparameter‐robust, deep learning approach to harmonization of neuroimaging data

Abstract Neuroimaging data acquired using multiple scanners or protocols are increasingly available. However, such data exhibit technical artifacts across batches which introduce confounding and decrease reproducibility. This is especially true when multi‐batch data are analyzed using complex downstream models which are more likely to pick up on and implicitly incorporate batch‐related information. Previously proposed image harmonization methods have sought to remove these batch effects; however, batch effects remain detectable in the data after applying these methods. We present DeepComBat, a deep learning harmonization method based on a conditional variational autoencoder and the ComBat method. DeepComBat combines the strengths of statistical and deep learning methods in order to account for the multivariate relationships between features while simultaneously relaxing strong assumptions made by previous deep learning harmonization methods. As a result, DeepComBat can perform multivariate harmonization while preserving data structure and avoiding the introduction of synthetic artifacts. We apply this method to cortical thickness measurements from a cognitive‐aging cohort and show DeepComBat qualitatively and quantitatively outperforms existing methods in removing batch effects while preserving biological heterogeneity. Additionally, DeepComBat provides a new perspective for statistically motivated deep learning harmonization methods.

batch effects remain detectable in the data after applying these methods.We present DeepComBat, a deep learning harmonization method based on a conditional variational autoencoder and the ComBat method.DeepComBat combines the strengths of statistical and deep learning methods in order to account for the multivariate relationships between features while simultaneously relaxing strong assumptions made by previous deep learning harmonization methods.As a result, DeepComBat can perform multivariate harmonization while preserving data structure and avoiding the introduction of synthetic artifacts.We apply this method to cortical thickness measurements from a cognitive-aging cohort and show DeepComBat qualitatively and quantitatively outperforms existing methods in removing batch effects while preserving biological heterogeneity.Additionally, DeepComBat provides a new perspective for statistically motivated deep learning harmonization methods.
However, in studies where feature-level data are used in a highly multivariate manner, univariate harmonization approaches may be insufficient.For example, as imaging researchers have become more interested in complex prediction efforts, multivariate feature datasets are used as inputs to predict an outcome of interest.In these settings, state-of-the-art machine learning (ML) algorithms are often used as powerful approaches that are able to jointly leverage the multivariate distribution of features, accounting for complex nonlinear and interaction effects (Hu et al., 2023;Koutsouleris et al., 2014;Smith et al., 2017;Wager et al., 2013).Batch effects that exist in the interactions between features may also be picked up by these ML algorithms, which can lead to decreased generalizability of these models and overfitting of model parameters on batch effects, especially when batch status is a relevant confounder for the outcome.Thus, recent efforts in feature-level harmonization have attempted to detect and mitigate such multivariate batch effects.
From the statistical perspective, recently proposed methods for multivariate harmonization have included CovBat (Chen et al., 2022), Bayesian factor regression (BFR, Avalos-Pacheco et al., 2022), and UNIFAC (Zhang et al., 2022).Like ComBat, these models assume batch effects can be effectively modeled through the combination of low-rank additive and multiplicative effects.However, instead of modeling batch effects solely in a univariate manner, CovBat additionally assumes batch effects to be present in the covariance structure of model residuals, while BFR and UNIFAC assume additive batch effects to be present in the direction of multivariate latent factors.Additionally, while ComBat, CovBat, and UNIFAC all seek to ultimately produce a dataset of harmonized features, BFR instead learns a lowdimensional representation of the original features where batch effects have been removed; BFR does not map this low-dimensional representation back to the feature space.
From the deep learning perspective, feature-level multivariate harmonization methods have leveraged the conditional variational autoencoder (CVAE) architecture, an adaptation of the standard variational autoencoder that attempts to disentangle the latent space distribution from covariates of interest (Kingma & Welling, 2014;Sohn et al., 2015).These models include diffusion CVAE (dcVAE, Moyer et al., 2020) and goal-specific CVAE (gcVAE, An et al., 2022).In dcVAE, an encoder embeds vector representations of diffusion MRI data as latent space distributions, and the encoder is penalized when batch-specific information is present in the latent space representation.Then, the decoder is given these latent space distributions along with explicit batch information and trained to reconstruct the original input.Through this process, dcVAE assumes that the encoder can learn to remove batch effects and the decoder can accurately reconstruct the original data while removing batch effects.However, An et al. (2022) noted that dcVAE removes biological information of interest.They proposed gcVAE could recover this biological information by fine-tuning the dcVAE decoder such that the decoder could not only accurately reconstruct the input but could also retain biological information of interest in the reconstruction.gcVAE encourages this behavior by adding a pre-trained neural network classifier to the end of dcVAE that uses decoder output to predict biological covariates of interest.Classifier success is rewarded in the loss function.
Additionally, a deep learning network called flow-based structural causal model has been explored for feature-level harmonization (Wang et al., 2021).This method seeks to learn the causal effect of batch, conditional on biological covariates, and sample from the posterior distribution under the counterfactual batch with the goal to improve predictive performance of a model trained in a reference dataset.
Notably, deep learning harmonization methods designed for feature-level data make a number of strong implicit assumptions.First, deep learning harmonization methods directly use model outputs from the harmonization step as the resulting harmonized data-unmodeled residual terms are unaccounted for, as well as any batch or biological effects in these residuals.Implicitly, this makes the strong assumption that the deep learning method achieves perfect or nearly perfect model fit-that is, the reconstruction loss is zero or nearly zero.This is in contrast to statistical harmonization methods, which tend to estimate batch effects within unmodeled residual as a difference in scale; the residuals are rescaled and added back to the model-based biological effects to produce the resulting harmonized data.Secondly, deep learning harmonization methods assume that batch and biological effects can be completely disentangled through loss function optimization and choice of network architecture.While this may be easily achievable in isolation, complete disentanglement may be challenging to achieve in conjunction with the implicit nearly perfect model fit assumption.Third, these deep learning methods use loss functions that are limited to harmonization between only two batches-settings where multiple batches exist are infeasible.Finally, deep learning harmonization methods often do not explicitly consider that biological covariates may be imbalanced across batches-in such cases, some differences across batches may actually be due to true biological differences and therefore should not be removed.
We propose a novel deep learning harmonization method, called DeepComBat, that effectively removes multivariate batch effects between two or more batches in a statistically informed manner.
Compared to statistical methods such as ComBat and CovBat, Deep-ComBat removes complex, nonlinear, and multivariate batch effects from the raw data in a way that mitigates detection of batch effects using highly multivariate methods.Compared to other deep learning methods, DeepComBat avoids making the assumptions described above-unmodeled residual terms are reintroduced, a completely disentangled latent space is not required, multiple batches can be harmonized, and model-based batch effects are removed conditional on biological covariates that may be confounders.To the best of our knowledge, DeepComBat is the first deep learning harmonization method that avoids such assumptions.Additionally, DeepComBat hyperparameters can be easily tuned manually, and DeepComBat can be thought to have a form of "double-robustness" such that even with poor model fit, reasonable harmonization can still be achieved.
Informed consent was obtained for all subjects in the ADNI study.
Institutional review boards approved the study at all of the contributing institutions.
We define three batches based on which manufacturer each subject's scanner was from.This included Siemens Healthineers (n = 280), General Electric (GE) Healthcare scanners (n = 287), and Philips Medical Systems (n = 96).For the purpose of Supplemental Materials where we compare DeepComBat to dcVAE and gcVAE, which can only harmonize between two batches, we define Siemens and non-Siemens as the two batches.Additionally, we define age, sex, and Alzheimer disease (AD) status (cognitively normal, late mild cognitive impairment [MCI], AD) as biological covariates of interest that may confound the relationship between batch status and T1w imaging-these covariates are known to affect brain structure and also may be associated with scanner manufacturer through differing population demographics across sites.Subject demographics at time of most recent acquisition are presented in Table 1, stratified by these two batches.Notably, there are marked differences in the distribution of sex across the three batches, suggesting that confounding of batch status by subject demographics is plausible, and estimation of batch effects should be conditioned on subject demographics.
Processing of these data was carried out using the Advanced Normalization Tools longitudinal single-subject template pipeline (Tustison et al., 2019).Briefly, we first downloaded raw T1w images from the ADNI-1 database, which were acquired using MPRAGE for Siemens and Philips scanners and using a works-in-progress version of MPRAGE for GE scanners (Jack Jr. et al., 2010).For each subject, we estimated a single-subject template using all image timepoints and applied rigid spatial normalization to this template for each timepoint image.Then, each normalized timepoint image is processed using the single-image cortical thickness pipeline consisting of (1) brain extraction (Avants et al., 2010), (2) denoising (Manjón et al., 2010), (3) N4 bias correction (Tustison et al., 2010), (4) Atropos n-tissue segmentation (Avants et al., 2011), and (5) registration-based cortical thickness estimation (Das et al., 2009).Finally, for our analyses, we used cortical thickness values for the 62 Desikan-Killiany-Tourville atlas regions such that the feature matrix we sought to harmonize was of dimension 663 Â 62 (Klein & Tourville, 2012).Scan metadata were determined based on information contained within the Digital Imaging and Communications in Medicine headers for each scan.

| ComBat model
We first review the ComBat (Combatting Batch Effects) model, which models additive and multiplicative batch effects in an empirical Bayes framework (Fortin et al., 2017;Johnson et al., 2007).This model is used as a building block for DeepComBat.For each subject, let information for that subject, where each y ijk is a scalar.In this notation, i ¼ 1,2,…, B indexes batch; j ¼ 1,2, …, n i indexes subjects within batch i, where n i is the number of subjects acquired in batch i; and k ¼ 1, 2,…, p indexes features, where p is the total number of features.
First, ComBat is fit on each feature individually using the following model: where α k is the vector of shared intercepts across batches; x ij is the vector of subject-specific biological covariates; β k is the vector of regression coefficients for the covariates; γ ik is the vector of mean batch effects for batch i conditional on the covariates; and δ ik is the vector of multiplicative batch effects on the residuals.ComBat assumes the errors, e ijk , are distributed N 0,σ 2 k À Á .
For each individual feature, least-squares estimates b α k and b β k are obtained.Then, to estimate batch effects using empirical Bayes, Com-Bat assumes the additive batch effects, γ ik , are drawn from a normal distribution prior and the multiplicative batch effects, δ ik , are drawn from an inverse gamma distribution prior.Hyperparameters for these priors are estimated via method of moments using data across all features.Next, for each feature-level, empirical Bayes estimates, γ Ã iv and δ Ã iv , are obtained as the means of their corresponding posterior distributions.This results in shrinkage estimators for both the additive and multiplicative batch effects such that these effects can be wellestimated even when within-batch sample size is small.Finally, estimated batch effects are removed using the following equation: ik is the subject-specific mean as estimated by the ComBat model.

| DeepComBat method
We propose two versions of DeepComBat-one for internal harmonization and one external harmonization.We define internal harmonization as harmonization in settings where the entire dataset is available and the goal is to remove confounding batch effects for univariate or  however, the linear transformations of features are stored such that they can be inverted, and the harmonized output will remain in the original feature space.

| Architecture
In the CVAE training step, normalized ADNI cortical thickness data are passed through a standard, fully connected CVAE-style model with the architecture shown in Figure 1.For architectural hyperparameters, the latent space was empirically chosen to be approximately one-fourth the size of the input vector, rounded to the nearest power of 2-in practice, latent spaces approximately one-eight or one-half the size of the input vector also performed similarly.Four hidden layers were used on either side of the latent space to allow for sufficient complexity of the encoder to learn meaningful latent space representations with minimal batch effects and of the decoder to incorporate batch effects in reconstruction.Hidden layer sizes were defined such that each size was approximately halfway between the size of the layers before and after.Hyperbolic tangent (TanH) activation functions were used between layers to allow for nonlinearitythe TanH activation function was empirically found to perform better than rectified linear units in this application.
One iteration through the CVAE for one subject is as follows.
First, let the encoder input be defined as the column-wise concatenation of the column vectors y ij , x ij , and b ij .This encoder input is passed through successive hidden layers until it is eventually encoded into , where Þ represent the encoder functions with neural network parameters θ 1 and θ 2 , respectively.These vectors together define a multivariate normal random variable, This random variable is the output of the encoder and can be thought of as subject ij's latent space representation.Next, to begin the decoding step, a sample is drawn from this random variable using the reparameterization trick in order to obtain z ij (Kingma & Welling, 2014).As with the encoder input, this sample is column-wise concatenated with x ij and b ij to produce the decoder input.Then, it is passed through the decoder hidden and output layers to obtain a reconstructed feature , where q ϕ Á, Á , Á ð Þ is the decoder function with neural network parameters ϕ.Note that this reconstructed feature vector is a function of the sample from the random variable Z ij and thus changes each time subject ij is passed through the CVAE.
Notably, while it may be unnecessary to provide the encoder with biological and batch covariates, we thought providing such covariates could be useful to the encoder for learning a covariate-invariant latent space.
Thus, the latent space distribution, Z ij , is a function of the features, y ij , as well as the covariates x ij and b ij .Similarly, the recon- is a function of the latent distribution through z ij , as well as the covariates x ij and b ij .Additionally, by giving the decoder random samples from the latent space distribution, the decoder learns that probabilistically nearby points in the latent space should be mapped to similar outputs in the feature space.That is, the decoder learns to reconstruct the features such that b The risk of overfitting by the decoder is also minimized, as this random sampling functions as a form of data augmentation with respect to the decoder.

| Loss function
The loss function was defined to be the standard CVAE loss function which consists of an autoencoder reconstruction loss component and a Kullback-Leibler (KL) divergence loss component (Kingma & Welling, 2014;Sohn et al., 2015).In the DeepComBat CVAE, this loss function is implemented for each subject as follows: and g Z ð Þ, which is defined in DeepComBat to be the probability density function of the standard multivariate normal distribution, N 0, I ð Þ.
The overall loss function is defined as the sum over all subjects: The KL divergence term can be thought to enforce a standard normal Bayesian prior on the latent space, where λ represents the strength of the prior.Thus, the KL divergence term allows for regularization of the latent space as well as encourages removal of information that is unnecessary for reconstruction from the latent space.In the DeepComBat CVAE, since biological and batch covariates are explicitly given to the decoder, optimal latent space representations should contain no information about these covariates and instead encode richer, subject-specific information.Practically, this complete independence may be unrealistic to achieve, and this is accounted for during the harmonization step.
Importantly, while biological and batch covariates are used as inputs for both the encoder and the decoder, the CVAE is not rewarded for including information about these covariates in the loss function.This design choice prevents the CVAE from introducing bias, but still allows the model to learn multivariate batch effects conditional on potential biological confounders.

| Optimization and hyperparameter tuning
This CVAE loss function is known to have the potential to suffer from KL vanishing, also referred to as posterior collapse, where a local minimum of the loss function is reached and the model cannot improve (Bowman et al., 2016).In KL vanishing, the encoder learns to collapse all latent space representations to the standard normal prior such that the KL component of the loss function is nearly zero, and the decoder is given total noise and is therefore unable to learn anything in order to make progress toward further minimizing the loss.To minimize risk of posterior collapse in the DeepComBat CVAE, we utilize a cyclic annealing optimization schedule (Fu et al., 2019).In this schedule, λ is gradually increased from 0 to the goal final KL divergence weight multiple times over the course of model training.This provides opportunities for the optimizer to escape local minimum when λ is small and allows for progressive learning of more meaningful latent representations across cycles.
In DeepComBat, we perform manual hyperparameter tuning to determine our desired final λ Final ¼ 0:1.The goal in tuning λ Final is to impose a prior that is strong enough to regularize the latent space and Euclidean distance between latent space representations are meaningful, but weak enough to allow for rich, subject-specific information to be encoded in the latent space in order to produce highquality reconstructions.
Using this λ Final , we first pre-train the CVAE for 5 epochs with For internal DeepComBat, optimization was performed using the Adam optimizer with learning rate of 0.01, chosen to increase the initial rate of model convergence (Kingma & Ba, 2017).For external DeepComBat, optimization was performed using the AdamW optimizer, a version of Adam which shrinks parameter estimates toward 0 and therefore introduces regularization and reduces overfitting (Loshchilov & Hutter, 2019).For AdamW, we used a learning rate of 0.01 and default weight decay.Within epochs, data were passed to the CVAE in mini-batches of 64 subjects.Additionally, in settings with larger numbers of features, such as with functional connectivity measures, the DeepComBat package allows for user specification of number of hidden layers as well as hidden layer and latent space sizes.Decreasing these hyperparameters will decrease the computational time necessary to train DeepComBat; however, other than to improve computational efficiency, these hyperparameters do not need to be tuned.Empirically, in the ADNI dataset, DeepComBat performance was robust to such changes in hyperparameters.

| Harmonization
Once the CVAE model has been trained, harmonization can be performed on the latent space, the CVAE decoder, and the reconstruction residuals, as shown in Figure 1.In the latent space, each subject's noisy latent space distribution, Z ij , is converted to the noiseless latent space mean vector, μ ij .Then, across all ij subjects, the ComBat model described above is fitted using both batch and biological covariates to harmonize the latent space.Let each ComBat-harmonized latent space representation be denoted as: Next, the decoder output is harmonized.In this step, the decoder input is changed such that it receives harmonized latent space mean vectors as well as the desired batch for the harmonized data.The decoder additionally continues to receive unchanged biological covariates.
Then, the reconstruction residuals are calculated and harmonized.
To estimate these residuals, noiseless reconstructions are first estimated by giving the decoder latent space mean vectors instead of the latent space distribution samples used during CVAE training.Then, reconstruction residuals are defined as the difference between reconstructions and the original data.These residuals are then corrected across all subjects using the ComBat model with both batch and biological covariates.

| Evaluation
DeepComBat was evaluated against unharmonized data as well as other feature-level harmonization methods where code was available.
These methods included ComBat and CovBat in the main manuscript, and ComBat, CovBat, dcVAE, and gcVAE in the Supplemental Materials (An et al., 2022;Chen et al., 2022;Fortin et al., 2017;Moyer et al., 2020).Notably, since no code was provided in the original manuscript for dcVAE, we implemented this method using code provided by An et al. (2022).For all comparison methods, we used default settings and hyperparameters provided in the code.Biological covariates of age, sex, and AD status were provided for ComBat, CovBat, and DeepComBat.Evaluation was conducted using qualitative visualization, statistical testing, and ML experiments.
In statistical testing and ML experiments, we assess the presence of both batch effects and biological effects.When assessing for batch effects, we assume that (1) test statistics corresponding to large pvalues for statistical tests and (2) worse performance in predicting batch for ML experiments correspond to less presence of batch effects and therefore better harmonization.However, when assessing for biological effects, effective harmonization may lead to better, worse, or similar results, depending on the underlying relationship between batch and biological covariates.

| Qualitative visualization
We visualize the overall multivariate distribution of unharmonized and harmonized feature matrices using Unifold Manifold Approximation and Projection (UMAP) and principal component analysis (PCA) (McInnes et al., 2020).UMAP was fit using the umap package in R with 20 neighbors, 100 epochs, and default settings otherwise.Points were displayed by batch status.PCA was fit on correlation matrices to account for differences in scale across features.For UMAP and PCA, arbitrary differences in sign due to model fitting were changed in order to improve direct comparability of these visualizations between methods.Additionally, we explore how harmonization methods act on a small random sample of features using bivariate density plots and plots of feature-level changes after harmonization.

| Statistical testing
Harmonization methods were evaluated using mass univariate and multivariate statistical testing.For mass univariate testing, we performed two-sample Anderson-Darling test on each feature, between pairwise batches, resulting in three p-values per feature.Average pvalue across all features and pairwise batches, as well as its standard deviation is reported.
To test for differences in feature-wise means across batch as well as assess for validity of downstream analyses on biological  (Olson, 1974).
For nonparametric multivariate testing, we use the k-nearestneighbor batch-effect test (kBET) metric with 500 repeats and default settings otherwise (Büttner et al., 2019).The kBET test is a nonpara-

| ML experiments
To evaluate how our method interacts with multivariate batch or biological effects, we train ML algorithms to predict covariate information using the harmonized feature matrix.Prediction models were independently trained to perform classification of batch, sex, and AD status, as well as regression of age.To perform the ML experiments, we use the caret package, version 6.0-93, to train and assess a large battery of ML algorithms on each feature matrix using the repeated cross-validation strategy, with 10 repeats of 10-fold cross-validation to estimate the out-of-sample predictive performance.
For two-class classification sex, average area under the receiver operating characteristic curve (AUROC) across validation sets is reported.For three-class classification of batch and AD, AUROC cannot be calculated, so average accuracy across validation sets is reported.For regression of age, average R 2 values across validation sets is reported.Note that in the repeated cross-validation strategy, average cross-validation metrics can be made arbitrarily precise by increasing the number of repeats, but variation in these metrics occurs within each cross-validation fold due to randomness in train-validation splitting and ML model fitting.
The ML evaluation battery for classification tasks consisted of: support vector machine (SVM) with radial basis, quadratic discriminant analysis (QDA), KNNs, random forest (RF), and Extreme Gradient Boosted trees (XGBoost).The ML evaluation battery for regression of age consisted of: SVM with radial basis, KNN, RF, and XGBoost.SVM, QDA, KNN, and RF were fit using the default hyperparameters provided by their corresponding R packages.For XGBoost a few hyperparameters were a priori changed from the default to allow for greater algorithm differences when compared to RF.These changed hyperparameters included: eta = 0.1 and colsample_bytree = 0.5.
Number of total boosting rounds had no default and was set to 100.
Other hyperparameters were set to their defaults.

| DeepComBat reduces batch effects in qualitative visualizations
We visualized the effect of DeepComBat univariately and multivariately for internal and external DeepComBat.For a representative, randomly sampled region's cortical thickness, density plots by batch revealed differences in distribution across batch in the raw data that could be attributed to differences in mean, variance, and shape (Figures 2a and 3a).This distributional difference was qualitatively

| DeepComBat removes statistically detectable batch effects and preserves inference on biological effects
Harmonization performance was assessed using parametric and nonparametric testing.Feature-wise linear regression results are presented in Table 2 as average negative log 10 p-values and in Supplemental Figures 3 and 4 as quantile-quantile plots of negative log 10 p-value distributions.On the raw data, this analysis showed significant differences in mean across batch, when biological covariates were included in the model.These differences were effectively removed by ComBat, CovBat, and DeepComBat.In statistical testing for multivariate mean effects using MANOVA, ComBat, CovBat, and DeepComBat, completely removed batch effects from the multivariate mean across features when biological covariates were also included (Table 2).
In both univariate and multivariate analyses, ComBat, CovBat, and DeepComBat preserved inference on biological covariates of age, sex, and AD status.Notably, DeepComBat was slightly more powerful than ComBat and CovBat for detecting almost all biological covariate effects in both internal and external harmonization settings.These increases in power for inference may reflect removal of batchattributable confounding.
In nonparametric testing, average feature-wise Anderson-Darling test results showed significant differences in univariate distributions across batches in the raw data (Table 3).These differences were effectively reduced by ComBat, CovBat, and DeepComBat.These results are further illustrated in Supplemental Figures 3 and 4 3).In contrast, kBET detected highly significant differences for raw data and ComBat.
In the two-batch setting, dcVAE and gcVAE increased the statistical difference in batch-wise univariate means and did not eliminate differences in multivariate means (Supplemental Table 1).Meanwhile, dcVAE and gcVAE showed large increases in univariate power for age and AD effects and a large decrease in power for sex effects.
These patterns may be explained by the exclusion of unmodeled residuals in dcVAE and gcVAE harmonized outputs, since statistical power is a function of both the effect size and the variance of residuals.dcVAE and gcVAE also performed poorly in nonparametric testing, according to both Anderson-Darling tests and kBET (Supplemental Table 2).These findings are qualitatively supported in Supplemental Figure 5.

| DeepComBat impairs detection of batch by ML algorithms and maintains predictability of biological covariates
A battery of ML experiments seeking to predict batch status were run on raw and harmonized data (Figures 4a and 5a).All classifiers could effectively determine the batch status of out-of-sample subjects in the raw data.This ability to detect batch was greatly decreased by all harmonization methods, with DeepComBat-harmonized data consistently corresponding to the lowest mean accuracy, especially for the ML algorithms with overall higher ability to discriminate batch.
Additionally, DeepComBat effectively retained biological information in its outputs.In Figure 4  These values ranged from 16 times greater than the λ Final used in our primary analysis to values 16 times less.Loss profiles for these robustness analyses are presented in Supplemental Figure 1.Harmonization results from these analyses are presented in Supplemental Tables 3   and 4 and Supplemental Figures 7-9.Overall, results for the highest value of λ Final , tuning using either cyclic annealing or constant scheduling, showed superior harmonization according to ML experiments and similar performance according to statistical tests.
This robustness result may be due to the design of DeepComBat, which partitions batch effects originally present in the raw data into one of these three components: the CVAE latent space, the CVAE decoder, and the reconstructed residuals.Component-wise harmonization therefore allows for a form of "double-robustness" with respect to CVAE fitting-if λ Final is too large such that most batch effects are contained in the reconstruction residuals, ComBat on these residuals will still allow for reasonable overall harmonization; and if λ Final is too small such that most batch effects are contained in the latent space, ComBat on the latent space will address the batch effects.

| DISCUSSION
Multi-batch neuroimaging data are increasingly common and necessary for learning generalizable models for inference and prediction.There is also growing interest in using ML techniques to perform multivariate pattern analysis and train powerful classifiers that can efficiently use multivariate data.To enable these efforts, there is increasing need for statistically rigorous multivariate harmonization methods.
In this study, we demonstrate that strong batch effects exist in raw data, and that these batch effects remain detectable by ML experiments even after state-of-the-art statistical harmonization methods are applied.We also find that, while previously proposed deep learning harmonization approaches are able to partially remove batch effects from the ADNI dataset, this batch effects correction comes at the cost of removal of relevant biological information as well as introduction of artifacts characteristic of synthetic data.We then propose DeepComBat, a novel hybrid method that is able to take advantage of the strengths of both deep learning and statistical methods-it uses the CVAE architecture to perform nonlinear, multivariate correction as well as the ComBat model to rigorously and robustly harmonize the latent space and residuals.magnitude of biological and batch effects, as well as on the nature of confounding between batch and biology (Hu et al., 2023).

| DeepComBat performance
For example, in this dataset, sex is observed to be imbalanced across the batches such that knowing a subject's batch provides information regarding the subject's sex.In this setting, if the ground truth sex effect is small relative to the batch effect, a classifier for sex trained on harmonized data should theoretically perform worse than the same classifier trained on raw data, and this pattern is empirically observed for ComBat, CovBat, and DeepComBat in Table 2 and Figures 4b and 5b.However, since no ground truth for postharmonization sex effect magnitude exists, it is unclear how much reduction in predictive performance for sex is accurate and therefore which method best recovers the ground truth.Similarly, an increase in estimates of biological effects may occur if minimal imbalance is present, but batch effect introduces bias or noise that covers up a smaller biological signal.This may occur in settings where one batch produces features of much higher quality than another, such as if one batch corresponds to higher Tesla imaging compared to another.Finally, no change in estimated biological effects may be expected if neither situation occurs or if both occur but balance each other out.However, regardless of how estimated biological effects change after harmonization, improved removal of batch effects leads to more generalizable and reproducible inference, since results can be trusted to be free of complex batch-related confounding structures.
Additionally, we empirically find that, in our dataset, higher values Overall, these results suggest that DeepComBat may be especially useful for harmonization in settings where prediction or inference using multivariate features and multivariate methods is the goal.In these settings, feature-wise correction using statistical methods may lead to significant non-corrected batch effects that may be picked up by prediction methods and inappropriately used.
4.2 | DeepComBat may be more robust to model misspecification when compared to statistical methods Similarly to many statistical methods, such as ComBat and CovBat, DeepComBat assumes batch effects can be estimated as differences in feature-wise conditional means and variances of unmodeled residuals.However, unlike statistical methods, mean batch effect are estimated nonlinearly and multivariately using a combination of batch and biological covariates along with subject-specific latent space representations.In this mean batch effect estimation procedure, mean batch effect are partially removed by the decoder in a nonparametric manner, where the only assumption on the nature of batch effect is that it can be approximated by the decoder network.Thus, while latent space harmonization still involves the ComBat model, overall harmonization may be less contingent on how well the data follow ComBat assumptions.Additionally, discussed further below, moment-matching of latent space representations has been empirically shown to be effective in various harmonization-like tasks (Fatania et al., 2022;Huang & Belongie, 2017;Lopez et al., 2018;Lotfollahi et al., 2019;Zuo et al., 2021).
In However, the first assumption is violated in situations where the latent-space dimensions are too small to adequately capture nonbatch information, the decoder is not complex enough to efficiently encode all batch-related information, and sample sizes within batches are too limited to estimate all the necessary network parameters.
These violations are further compounded when the first two assumptions are considered together, while near-perfect model fit may be achievable with a large latent space, it is even more challenging when a completely batch-invariant latent space is required.Finally, in neuroimaging datasets where biological covariates are imbalanced across batches, such as in the ADNI dataset used in this study, complete removal of marginal batch-wise differences will necessarily involve removal of biological information as well.
DeepComBat is able to relax these strong implicit assumptions by (1) accounting for the presence of reconstruction residuals and reintroducing them on top of the CVAE-harmonized subject-level means, (2) explicitly removing batch effects from the CVAE latent space, and (3) conditioning on biological covariates at each harmonization step.
By relaxing these assumptions, we are able to greatly improve the usability of DeepComBat by simplifying its architecture when compared to that of dcVAE and gcVAE.For example, dcVAE and gcVAE rely on adversarial training with a discriminator in order to train their decoders to produce more realistic outputs, but DeepComBat no longer needs this adversarial component since non-perfect model fit is acceptable.This minimizes computational burden and avoids common challenges in adversarial training.DeepComBat also circumvents the need for precise tuning of the KL divergence weighting hyperparameter, since remaining batch effects in the latent space are explicitly removed after CVAE training.
Importantly, relaxing these assumptions allows for DeepComBat to be designed such that, if a subject-level feature vector is "self-harmonized" back to its actual batch, that feature vector will be unchanged.This makes sense, since "self-harmonization" should be the identity function.However, in other deep learning harmonization methods, including dcVAE and gcVAE, since reconstruction residuals are not explicitly accounted for in these other methods, the "selfharmonized" data will have less noise.This phenomena has been highlighted by Dewey et al. (2019) Specifically, in scRNA-seq batch effects correction, scGen encodes gene expression data to a latent space using a standard variational autoencoder (Lotfollahi et al., 2019).Then, the algorithm estimates and removes mean batch effects, or first moments, from this latent space, conditional on cell type, in order to perform correction.CVAE-based methods such as dcVAE, gcVAE, and a number of scRNA-seq methods, such as scVI, can also be thought to perform latent-space moment-matching (Lopez et al., 2018); however, these methods do so implicitly through the loss function, rather than by explicitly estimating and correcting latent space coordinates.
Additionally, in image style transfer, where the goal is to change the style of an image without changing its content, adaptive instance normalization (AdaIN) can be used along with a convolutional autoencoder and its variations (Huang & Belongie, 2017).In the convolu-  variants, may be a more reasonable choice (Hu et al., 2023).

| CONCLUSION
DeepComBat to cortical thickness measurements acquired through the Alzheimer's Disease Neuroimaging Initiative (ADNI) and compare our results to those of other feature-level harmonization methods where open-source code was available, namely: ComBat, CovBat, dcVAE (modified for non-diffusion setting), and gcVAE.Compared to other methods, DeepComBat-harmonized data retain biological information of interest while containing less batch information.Our results demonstrate the advantage of incorporating statistical ideas into deep learning methods to perform multivariate harmonization.
multivariate inference.Meanwhile, external harmonization is useful in settings where a subset of the dataset is available for training harmonization methods and prediction models, and future out-of-sample data needs to be brought into the dataset.Internal and external Deep-ComBat are nearly identical, with the exception of how CVAE parameters are optimized, described below.For internal harmonization, all 663 subjects were included for ComBat, CovBat, and DeepComBat training and harmonization.For external harmonization, the DeepComBat-harmonized dataset was T A B L E 1 Patient demographics at time of acquisition, stratified by batch.
and λ is a hyperparameter to weight the relative importance of the two components.The KL divergence component measures the difference between f Z ij À Á , which is the probability density function of the multivariate normal latent space distribution for subject

F
I G U R E 1 Top: DeepComBat conditional variational autoencoder (CVAE) architecture and loss functions used during training.Bottom: DeepComBat CVAE algorithm used during the harmonization step.At this step, encoder and decoder parameters have been learned during the training step and are frozen.Notation corresponds to that in the main text.

λ
¼ 0, then perform cyclic annealing over 30 epochs where one cycle is 5 epochs and λ increases linearly from 0 to λ Final within each cycle, and finally train the CVAE for 5 epochs with the desired λ ¼ λ Final .This cyclic annealing procedure is computationally helpful to improve convergence and is performed in a fully automated manner-users are only required to specify the desired λ Final .Compared to a more straightforward constant-schedule training schedule where the CVAE is trained using a constant λ ¼ λ Final for all 40 total epochs, the cyclic annealing schedule leads to significantly lower final overall and reconstruction losses as well as significantly higher final prior losses when λ Final is large (Supplemental Figure1, p < .05).Thus, cyclic schedule training may allow for better estimation of subject-specific multivariate means using subject-specific latent space factors that are independent of batch and biological covariates.Additionally, when λ Final is large, the variances of final reconstruction and prior losses appear qualitatively larger under the constant schedule compared to the cyclic schedule.This suggests cyclic schedule training may be more consistent compared to constant schedule training when trade-offs between reconstruction loss and prior loss are high.We present results using cyclic annealing in the main text and provide additional results using both cyclic annealing and constant scheduling for various λ Final in the Supplemental Materials.
Note that, in contrast to similar CVAE-based harmonization methods like dcVAE, gcVAE, and a number of image-based methods which require a KL divergence component hyperparameter such that latent space distributions are independent of batch, the Deep-ComBat λ Final is instead only used to regularize the latent space and reduce the amount of batch information in the latent space, if possible.However, substantial remaining batch information in the Deep-ComBat latent space is allowed, which enables easier hyperparameter tuning.
metric permutation-based test developed and validated in the context of detecting batch effects in single-cell RNA-sequencing (scRNA-seq) that (1) randomly samples a proportion of observations, (2) identifies each observation's k-nearest neighbors (KNNs), (3) evaluates whether the local distribution of batch among each set of KNNs differs from the global distribution of batch, and (4) generates an overall kBET statistic evaluating whether the number of observations with large differences in local distribution of batch are greater than that expected.
mitigated by ComBat, CovBat, and DeepComBat.Similar results were observed in visualizations of the multivariate feature distribution using the first two principal components and the first two UMAP dimensions(Figures 2b,c and 3b,c).Finally, we explored how various harmonization methods change the raw data at the feature level.Here, we randomly sampled 10 cortical thickness features and randomly sampled 100 subjects to obtain a total of 1000 randomly sampled cortical thickness values.For each harmonization method, we plotted harmonized values for these cortical thicknesses against their corresponding raw values in Figures2d and 3d.In this visualization, ComBat and CovBat seemed to mostly induce linear shifts in the data, consistent with their underlying shift and scale models.DeepComBat induced small nonlinear shifts in the data on a similar scale as ComBat and CovBat.ComBat, CovBat, and DeepComBat performed similarly in the two-batch setting, while dcVAE and gcVAE showed substantial transformation of the raw data (Supplemental Figure2).Additionally, dcVAE and gcVAE mapped harmonized values to their corresponding CVAE-predicted mean values without accounting for unmodeled CVAE reconstruction errors (Supplemental Figure2d).Thus, dcVAE and gcVAE produced outputs with noise patterns characteristic of synthetic data, as noted byDewey et al. (2019).Overall, qualitative visualizations showed DeepComBat effectively matched univariate feature distributions across batches, preserved the underlying multivariate structure of the data, estimated harmonized values that were highly correlated with the corresponding raw values, and avoided introduction of synthetic artifacts.
. In these figures, Com-Bat, CovBat, and DeepComBat show p-value distributions qualitatively similar to a uniform distribution, while raw data showed p-value distributions with more highly significant p-values than expected under a uniform distribution.Finally, when assessed via kBET, CovBat and DeepComBat both produced harmonized outputs where the distributions of batch within local neighborhoods were not significantly different from the global distribution, though DeepComBat showed a slight advantage in both internal and external harmonization (Table Overall, DeepComBat removed statistically detectable batch effects and preserved biological information without introducing detectable bias.In univariate analyses, DeepComBat performed similarly to ComBat and CovBat, while in multivariate analyses, Deep-ComBat outperformed ComBat and CovBat.DeepComBat outperformed dcVAE and gcVAE by all metrics.F I G U R E 2 Internal harmonization qualitative visualizations.In panels a-c, red corresponds to Siemens, green corresponds to General Electric (GE), and blue corresponds to Philips.(a) Density plots of one randomly sampled feature.(b) Principal component analysis (PCA) plots where PCA ellipses denote major and minor axes for each batch, centered at the batch-wise mean.(c) Unifold Manifold Approximation and Projection (UMAP) plots.(d) Randomly sampled harmonized values plotted against their corresponding raw values.Colors indicate each of the 10 randomly sampled cortical thickness features.
and 5, DeepComBat-harmonized data showed predictive performances similar to those of raw, ComBatcorrected, and CovBat-corrected data.In the case of age, removal of batch effects using DeepComBat increased the detectability of agerelated information in the resulting data.In two-batch harmonization, DeepComBat-harmonized data contained less batch information that data harmonized by other methods, and dcVAE and gcVAE removed less batch effects compared to Com-Bat, CovBat, and DeepComBat (Supplemental Figure6).All postharmonization predictive performances for biological covariates under ComBat, CovBat, and DeepComBat were significantly higher than those of dcVAE and gcVAE, suggesting that ignoring CVAE reconstruction residuals may be detrimental to preservation of non-batch information.Overall, we found that DeepComBat more effectively removed multivariate batch effects than other harmonization methods, even when assessed via powerful ML algorithms such as XGBoost.Additionally, DeepComBat effectively preserved biological information in the predictive context.3.4 | DeepComBat is robust to hyperparameter choice and training scheduleEmpirically, DeepComBat still performed effective harmonization over a range of λ Final and using either cyclic annealing or constant training F I G U R E 3 External harmonization qualitative visualizations.In panels a-c, red corresponds to Siemens, green corresponds to General Electric (GE), and blue corresponds to Philips.(a) Density plots of one randomly sampled feature.(b) Principal component analysis (PCA) plots where PCA ellipses denote major and minor axes for each batch, centered at the batch-wise mean.(c) Unifold Manifold Approximation and Projection (UMAP) plots.(d) Randomly sampled harmonized values plotted against their corresponding raw values.Colors indicate each of the 10 randomly sampled cortical thickness features.schedule.The above analyses were run on DeepComBat internal harmonization outputs under a wide range of hyperparameter values.
When compared to other methods, we show DeepComBat performs more effectively when evaluated by highly multivariate ML experiments as well as nonparametric kBET testing.It performs comparably to ComBat and CovBat when evaluated by statistical tests.Meanwhile, DeepComBat had comparable predictive performance for sex, diagnosis, and age when compared to ComBat and CovBat, but superior performance when compared to dcVAE and gcVAE in the two-batch setting.While increased biological effect sizes after harmonization may be desirable, the way in which biological effect sizes change after harmonization are known to depend on the ground truth F I G U R E 4 Internal harmonization machine learning results.Validation set performance metrics are shown for 10 repeats of 10-fold cross validation.Error bars correspond to 95% confidence intervals.Dashed red lines display expected performance of a weighted random classifier.(a) Average accuracy for predicting batch.Lower is better.(b) Average area under the receiver operating characteristic curve (AUROC) for predicting sex.(c) Average accuracy for predicting Alzheimer disease status.(d) Average R 2 value for predicting age.

of λ
Final lead to slightly superior DeepComBat performance under both cyclic annealing and constant scheduling.This suggests more extensive tuning of λ Final for each new dataset could lead to superior results; however, increasing λ Final by too much will lead to elimination of all subject-specific information from the CVAE latent space and forego the benefits of DeepComBat over standard ComBat.Thus, we choose to use a relatively small λ Final of 0.1 by default.This choice allows for adequate latent space regularization with minimal risk of overweighting the prior loss.Similarly, we fit DeepComBat using F I G U R E 5 External harmonization machine learning results.Validation set performance metrics are shown for 10 repeats of 10-fold cross validation.Error bars correspond to 95% confidence intervals.Dashed red lines display expected performance of a weighted random classifier.(a) Average accuracy for predicting batch.Lower is better.(b) Average area under the receiver operating characteristic curve (AUROC) for predicting sex.(c) Average accuracy for predicting Alzheimer disease status.(d) Average R 2 value for predicting age.cyclic annealing by default despite comparable performance under constant scheduling-since cyclic annealing allows for better stability with respect to KL vanishing when λ Final is high, cyclic annealing may provide better fitting in certain datasets.To provide flexibility, we provide functionality for manual selection of λ Final as well as training schedule in the DeepComBat R package.
terms of correcting batch effect in unmodeled residuals, Deep-ComBat argues a meaningful portion of what statistical methods claims are "unmodeled residuals"-information that is not explained by biological nor batch covariates by the naive linear model-can in fact be explained as a multivariate nonlinear function of biological covariates, batch covariates, and subject-specific latent factors.Through the CVAE architecture, DeepComBat is able to significantly reduce the mean squared error between model-predicted feature vectors and raw feature vectors when compared to ComBat and CovBat.Thus, DeepComBat is able to directly model and correct more batch effect in terms of conditional differences in mean, and less batch effect is corrected based on the strong assumption that there are batch-wise differences in the variances of unmodeled residuals.Subsequently, although DeepComBat still uses the ComBat model to correct the residuals, it may rely less on ComBat-specific assumptions since the magnitude of batch effect correction on the residuals is smaller.4.3 | DeepComBat relaxes strong assumptions made by other deep learning methods and simplifies model fittingPrevious feature-level deep learning harmonization methods, including dcVAE and gcVAE make a number of strong implicit assumptions.These assumptions include (1) perfect model fit, which assumes that reconstruction residuals insignificant and therefore do not need to accounted for or reincorporated, (2) fully disentangleable latent space, which assumes that the neural network can completely learn a batchinvariant latent space based on the loss function alone, and (3) balanced biological covariates across batches, which assumes that all population-level differences across batch are in fact due to batch and should be removed.
tional autoencoder, images are encoded into a set of latent space convolutional filters.Then, AdaIN performs style transfer by matching the means and variances of each filter, learned from the original image, to the means and variances of the corresponding filters learned from the image that has the desired style.In the non-convolutional setting of DeepComBat, each 1 Â 1 element of the latent space vector corresponds to one convolutional filter, and similar moment-matching is performed, but at the group level instead of the individual input level.Finally, outside of deep learning methods, DeepComBat draws on ideas from CovBat, which has been shown to harmonize the mean and covariance across sites.CovBat first performs standard ComBat and then corrects the covariance structure of residuals by projecting them into a latent space defined by principal components and running ComBat again.Thus, CovBat performs univariate mean harmonization and linearly multivariate residual harmonization.DeepComBat flips these steps-it first performs nonlinear multivariate mean harmonization and then univariately corrects the reconstruction residuals.Notably, DeepComBat autoencoder residuals are much smaller in magnitude than CovBat linear model residuals, so univariate residual correction is sufficient.

4. 5
| LimitationsWe show DeepComBat to be a promising tool for multivariate, deep learning harmonization.However, more work must be done to continue characterizing and extending DeepComBat.First, DeepComBat performance should be validated in other forms of high-dimensional, large neuroimaging datasets with highly correlated features.Examples of such datasets include voxel-wise images, diffusion imaging features, as well as functional connectivity and other network features.Additionally, while DeepComBat is statistically rigorous and robust to hyperparameter choice, training schedules, and both internal and external harmonization settings, DeepComBat still contains black box deep learning elements which may limit utility in settings where traceability is essential, such as in clinical trials.Finally, DeepComBat is designed for harmonization of cross-sectional studies where only one batch variable is present and cannot be applied to longitudinal datasets or datasets where two or more variables introduce technical noise.In clinical or non-cross-sectional datasets, well-characterized methods with analytical solutions, such as standard ComBat and its DeepComBat is a novel, statistically rigorous, deep learning approach to image harmonization that leverages deep learning and statistical concepts to perform multivariate batch effects correction conditional on biological covariates.We demonstrate it can more effectively remove multivariate batch effects from structural neuroimaging feature while preserving biological information than previously proposed methods.Additionally, DeepComBat proposes marked innovations over previous deep learning harmonization methods, allowing for conditioning on covariates, preservation of raw data characteristics, harmonization of more than two batches, robustness to choice of hyperparameters, and external harmonization capabilities.As highdimensional, multi-batch data becomes more common and interest in using ML techniques to analyze such data grows, we hope that Deep-ComBat will serve as a tool for end-users to remove multivariate batch confounding.Additionally, we hope the statistically motivated design of DeepComBat provides a new perspective for methodologists to continue developing improved deep learning harmonization methods.AUTHOR CONTRIBUTION STATEMENT Fengling Hu: Conceptualization, methodology, software, validation, formal analysis, investigation, writing-original draft, writing-review and editing, visualization.Alfredo Lucas: Software, validation, investigation, visualization, writing-review and editing.Andrew A. Chen: Methodology, software, validation.Kyle Coleman: Methodology, software.Hannah Horng: Validation.Raymond W.S. Ng: Software, investigation.Nicholas J. Tustison: Resources, data curation.Haochang Shou: Methodology, investigation, writing-review and editing.Kathryn A. Davis: Resources, funding acquisition.Mingyao Li: T A B L E 2 Parametric statistical testing results, reported as negative log 10 p-values.Negative log 10 of conventional p-value threshold 0.05 is 1.30.Larger is more significant.
ComBat in the CVAE latent space in order to generate a batchinvariant latent space.In this step, ComBat is used as a momentmatching model that takes advantage of shrinkage estimation in order to match conditional means and variances across batches.Analogies between latent-space ComBat and other moment-matching style transfer algorithms can be drawn.
, who noticed that DeepHarmony, an image-level harmonization method, produced harmonized images